In this assignment I am given the task of determining whether daily concentration of PM2.5 has decreased in California in the last 15 years. To do this I am using data provided by the U.S. EPA.


Problem 1. Read in the data. Check dimensions, headers, footers, variable names, and variable types. Check for any data issues, particularly in the key variables. Summarize all findings.

I read in the data using data.table():

pm2004 <- data.table::fread("ad_viz_plotval_data_2004.csv")
pm2004 <- clean_names(pm2004)
pm2019 <- data.table::fread("ad_viz_plotval_data_2019.csv")
pm2019 <- clean_names(pm2019)

I then check dimensions, headers, footers, variable names, variable types, and summary statistics. I check for missing values, and I plot histograms to visually inspect the data.

# Check dimensions
paste("The 2004 dataset has", dim(pm2004)[1], "observations and", dim(pm2004)[2], "variables.")
## [1] "The 2004 dataset has 19233 observations and 20 variables."
paste("The 2019 dataset has", dim(pm2019)[1], "observations and", dim(pm2019)[2], "variables.")
## [1] "The 2019 dataset has 53086 observations and 20 variables."
# Check headers
head(pm2004, n=3)
##          date source  site_id poc daily_mean_pm2_5_concentration    units
## 1: 01/01/2004    AQS 60010007   1                           11.0 ug/m3 LC
## 2: 01/02/2004    AQS 60010007   1                           12.2 ug/m3 LC
## 3: 01/03/2004    AQS 60010007   1                           16.5 ug/m3 LC
##    daily_aqi_value site_name daily_obs_count percent_complete
## 1:              46 Livermore               1              100
## 2:              51 Livermore               1              100
## 3:              60 Livermore               1              100
##    aqs_parameter_code                     aqs_parameter_desc cbsa_code
## 1:              88502 Acceptable PM2.5 AQI & Speciation Mass     41860
## 2:              88502 Acceptable PM2.5 AQI & Speciation Mass     41860
## 3:              88502 Acceptable PM2.5 AQI & Speciation Mass     41860
##                            cbsa_name state_code      state county_code  county
## 1: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
## 2: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
## 3: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
##    site_latitude site_longitude
## 1:      37.68753      -121.7842
## 2:      37.68753      -121.7842
## 3:      37.68753      -121.7842
head(pm2019, n=3)
##          date source  site_id poc daily_mean_pm2_5_concentration    units
## 1: 01/01/2019    AQS 60010007   3                            5.7 ug/m3 LC
## 2: 01/02/2019    AQS 60010007   3                           11.9 ug/m3 LC
## 3: 01/03/2019    AQS 60010007   3                           20.1 ug/m3 LC
##    daily_aqi_value site_name daily_obs_count percent_complete
## 1:              24 Livermore               1              100
## 2:              50 Livermore               1              100
## 3:              68 Livermore               1              100
##    aqs_parameter_code       aqs_parameter_desc cbsa_code
## 1:              88101 PM2.5 - Local Conditions     41860
## 2:              88101 PM2.5 - Local Conditions     41860
## 3:              88101 PM2.5 - Local Conditions     41860
##                            cbsa_name state_code      state county_code  county
## 1: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
## 2: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
## 3: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
##    site_latitude site_longitude
## 1:      37.68753      -121.7842
## 2:      37.68753      -121.7842
## 3:      37.68753      -121.7842
# Check footers
tail(pm2004, n=3)
##          date source  site_id poc daily_mean_pm2_5_concentration    units
## 1: 12/23/2004    AQS 61131003   1                              9 ug/m3 LC
## 2: 12/26/2004    AQS 61131003   1                             24 ug/m3 LC
## 3: 12/29/2004    AQS 61131003   1                              9 ug/m3 LC
##    daily_aqi_value            site_name daily_obs_count percent_complete
## 1:              38 Woodland-Gibson Road               1              100
## 2:              76 Woodland-Gibson Road               1              100
## 3:              38 Woodland-Gibson Road               1              100
##    aqs_parameter_code       aqs_parameter_desc cbsa_code
## 1:              88101 PM2.5 - Local Conditions     40900
## 2:              88101 PM2.5 - Local Conditions     40900
## 3:              88101 PM2.5 - Local Conditions     40900
##                                  cbsa_name state_code      state county_code
## 1: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
## 2: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
## 3: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
##    county site_latitude site_longitude
## 1:   Yolo      38.66121      -121.7327
## 2:   Yolo      38.66121      -121.7327
## 3:   Yolo      38.66121      -121.7327
tail(pm2019, n=3)
##          date source  site_id poc daily_mean_pm2_5_concentration    units
## 1: 12/17/2019    AQS 61131003   1                           23.8 ug/m3 LC
## 2: 12/23/2019    AQS 61131003   1                            1.0 ug/m3 LC
## 3: 12/29/2019    AQS 61131003   1                            9.1 ug/m3 LC
##    daily_aqi_value            site_name daily_obs_count percent_complete
## 1:              76 Woodland-Gibson Road               1              100
## 2:               4 Woodland-Gibson Road               1              100
## 3:              38 Woodland-Gibson Road               1              100
##    aqs_parameter_code       aqs_parameter_desc cbsa_code
## 1:              88101 PM2.5 - Local Conditions     40900
## 2:              88101 PM2.5 - Local Conditions     40900
## 3:              88101 PM2.5 - Local Conditions     40900
##                                  cbsa_name state_code      state county_code
## 1: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
## 2: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
## 3: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
##    county site_latitude site_longitude
## 1:   Yolo      38.66121      -121.7327
## 2:   Yolo      38.66121      -121.7327
## 3:   Yolo      38.66121      -121.7327
# Check variable names
paste(colnames(pm2004))
##  [1] "date"                           "source"                        
##  [3] "site_id"                        "poc"                           
##  [5] "daily_mean_pm2_5_concentration" "units"                         
##  [7] "daily_aqi_value"                "site_name"                     
##  [9] "daily_obs_count"                "percent_complete"              
## [11] "aqs_parameter_code"             "aqs_parameter_desc"            
## [13] "cbsa_code"                      "cbsa_name"                     
## [15] "state_code"                     "state"                         
## [17] "county_code"                    "county"                        
## [19] "site_latitude"                  "site_longitude"
# Check variable types
sapply(pm2004, typeof)
##                           date                         source 
##                    "character"                    "character" 
##                        site_id                            poc 
##                      "integer"                      "integer" 
## daily_mean_pm2_5_concentration                          units 
##                       "double"                    "character" 
##                daily_aqi_value                      site_name 
##                      "integer"                    "character" 
##                daily_obs_count               percent_complete 
##                      "integer"                       "double" 
##             aqs_parameter_code             aqs_parameter_desc 
##                      "integer"                    "character" 
##                      cbsa_code                      cbsa_name 
##                      "integer"                    "character" 
##                     state_code                          state 
##                      "integer"                    "character" 
##                    county_code                         county 
##                      "integer"                    "character" 
##                  site_latitude                 site_longitude 
##                       "double"                       "double"
# Check for issues in key variables:
str(pm2004) # General readout
## Classes 'data.table' and 'data.frame':   19233 obs. of  20 variables:
##  $ date                          : chr  "01/01/2004" "01/02/2004" "01/03/2004" "01/04/2004" ...
##  $ source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
##  $ site_id                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
##  $ poc                           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ daily_mean_pm2_5_concentration: num  11 12.2 16.5 19.5 11.5 32.5 15.5 29.9 21 16.9 ...
##  $ units                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
##  $ daily_aqi_value               : int  46 51 60 67 48 94 58 88 70 61 ...
##  $ site_name                     : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
##  $ daily_obs_count               : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ percent_complete              : num  100 100 100 100 100 100 100 100 100 100 ...
##  $ aqs_parameter_code            : int  88502 88502 88502 88502 88502 88502 88502 88502 88502 88502 ...
##  $ aqs_parameter_desc            : chr  "Acceptable PM2.5 AQI & Speciation Mass" "Acceptable PM2.5 AQI & Speciation Mass" "Acceptable PM2.5 AQI & Speciation Mass" "Acceptable PM2.5 AQI & Speciation Mass" ...
##  $ cbsa_code                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
##  $ cbsa_name                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
##  $ state_code                    : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ state                         : chr  "California" "California" "California" "California" ...
##  $ county_code                   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ county                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
##  $ site_latitude                 : num  37.7 37.7 37.7 37.7 37.7 ...
##  $ site_longitude                : num  -122 -122 -122 -122 -122 ...
##  - attr(*, ".internal.selfref")=<externalptr>
summary(pm2004) #Summary statistics
##      date              source             site_id              poc        
##  Length:19233       Length:19233       Min.   :60010007   Min.   : 1.000  
##  Class :character   Class :character   1st Qu.:60370002   1st Qu.: 1.000  
##  Mode  :character   Mode  :character   Median :60658001   Median : 1.000  
##                                        Mean   :60588026   Mean   : 1.816  
##                                        3rd Qu.:60750006   3rd Qu.: 2.000  
##                                        Max.   :61131003   Max.   :12.000  
##                                                                           
##  daily_mean_pm2_5_concentration    units           daily_aqi_value 
##  Min.   : -0.10                 Length:19233       Min.   :  0.00  
##  1st Qu.:  6.00                 Class :character   1st Qu.: 25.00  
##  Median : 10.10                 Mode  :character   Median : 42.00  
##  Mean   : 13.13                                    Mean   : 46.33  
##  3rd Qu.: 16.30                                    3rd Qu.: 60.00  
##  Max.   :251.00                                    Max.   :301.00  
##                                                                    
##   site_name         daily_obs_count percent_complete aqs_parameter_code
##  Length:19233       Min.   :1       Min.   :100      Min.   :88101     
##  Class :character   1st Qu.:1       1st Qu.:100      1st Qu.:88101     
##  Mode  :character   Median :1       Median :100      Median :88101     
##                     Mean   :1       Mean   :100      Mean   :88267     
##                     3rd Qu.:1       3rd Qu.:100      3rd Qu.:88502     
##                     Max.   :1       Max.   :100      Max.   :88502     
##                                                                        
##  aqs_parameter_desc   cbsa_code      cbsa_name           state_code
##  Length:19233       Min.   :12540   Length:19233       Min.   :6   
##  Class :character   1st Qu.:31080   Class :character   1st Qu.:6   
##  Mode  :character   Median :40140   Mode  :character   Median :6   
##                     Mean   :35328                      Mean   :6   
##                     3rd Qu.:41860                      3rd Qu.:6   
##                     Max.   :49700                      Max.   :6   
##                     NA's   :1253                                   
##     state            county_code        county          site_latitude  
##  Length:19233       Min.   :  1.00   Length:19233       Min.   :32.63  
##  Class :character   1st Qu.: 37.00   Class :character   1st Qu.:34.07  
##  Mode  :character   Median : 65.00   Mode  :character   Median :36.48  
##                     Mean   : 58.63                      Mean   :36.23  
##                     3rd Qu.: 75.00                      3rd Qu.:38.10  
##                     Max.   :113.00                      Max.   :41.71  
##                                                                        
##  site_longitude  
##  Min.   :-124.2  
##  1st Qu.:-121.6  
##  Median :-119.3  
##  Mean   :-119.7  
##  3rd Qu.:-117.9  
##  Max.   :-115.5  
## 
summary(pm2004$daily_mean_pm2_5_concentration) #Key variable
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -0.10    6.00   10.10   13.13   16.30  251.00
mean(is.na(pm2004$daily_mean_pm2_5_concentration)) #No missing values
## [1] 0
hist(pm2004$daily_mean_pm2_5_concentration, breaks = 100, main = "Histogram of all 2004 PM 2.5 data", xlab = "PM 2.5", xlim=c(0,100)) #2004 histogram

str(pm2019) # General readout
## Classes 'data.table' and 'data.frame':   53086 obs. of  20 variables:
##  $ date                          : chr  "01/01/2019" "01/02/2019" "01/03/2019" "01/04/2019" ...
##  $ source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
##  $ site_id                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
##  $ poc                           : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ daily_mean_pm2_5_concentration: num  5.7 11.9 20.1 28.8 11.2 2.7 2.8 7 3.1 7.1 ...
##  $ units                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
##  $ daily_aqi_value               : int  24 50 68 86 47 11 12 29 13 30 ...
##  $ site_name                     : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
##  $ daily_obs_count               : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ percent_complete              : num  100 100 100 100 100 100 100 100 100 100 ...
##  $ aqs_parameter_code            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
##  $ aqs_parameter_desc            : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
##  $ cbsa_code                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
##  $ cbsa_name                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
##  $ state_code                    : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ state                         : chr  "California" "California" "California" "California" ...
##  $ county_code                   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ county                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
##  $ site_latitude                 : num  37.7 37.7 37.7 37.7 37.7 ...
##  $ site_longitude                : num  -122 -122 -122 -122 -122 ...
##  - attr(*, ".internal.selfref")=<externalptr>
summary(pm2019) #Summary statistics
##      date              source             site_id              poc        
##  Length:53086       Length:53086       Min.   :60010007   Min.   : 1.000  
##  Class :character   Class :character   1st Qu.:60310004   1st Qu.: 1.000  
##  Mode  :character   Mode  :character   Median :60612003   Median : 3.000  
##                                        Mean   :60565291   Mean   : 2.562  
##                                        3rd Qu.:60771002   3rd Qu.: 3.000  
##                                        Max.   :61131003   Max.   :21.000  
##                                                                           
##  daily_mean_pm2_5_concentration    units           daily_aqi_value 
##  Min.   : -2.200                Length:53086       Min.   :  0.00  
##  1st Qu.:  4.000                Class :character   1st Qu.: 17.00  
##  Median :  6.500                Mode  :character   Median : 27.00  
##  Mean   :  7.734                                   Mean   : 30.56  
##  3rd Qu.:  9.900                                   3rd Qu.: 41.00  
##  Max.   :120.900                                   Max.   :185.00  
##                                                                    
##   site_name         daily_obs_count percent_complete aqs_parameter_code
##  Length:53086       Min.   :1       Min.   :100      Min.   :88101     
##  Class :character   1st Qu.:1       1st Qu.:100      1st Qu.:88101     
##  Mode  :character   Median :1       Median :100      Median :88101     
##                     Mean   :1       Mean   :100      Mean   :88214     
##                     3rd Qu.:1       3rd Qu.:100      3rd Qu.:88502     
##                     Max.   :1       Max.   :100      Max.   :88502     
##                                                                        
##  aqs_parameter_desc   cbsa_code      cbsa_name           state_code
##  Length:53086       Min.   :12540   Length:53086       Min.   :6   
##  Class :character   1st Qu.:31080   Class :character   1st Qu.:6   
##  Mode  :character   Median :40140   Mode  :character   Median :6   
##                     Mean   :35841                      Mean   :6   
##                     3rd Qu.:41860                      3rd Qu.:6   
##                     Max.   :49700                      Max.   :6   
##                     NA's   :4181                                   
##     state            county_code        county          site_latitude  
##  Length:53086       Min.   :  1.00   Length:53086       Min.   :32.58  
##  Class :character   1st Qu.: 31.00   Class :character   1st Qu.:34.14  
##  Mode  :character   Median : 61.00   Mode  :character   Median :36.63  
##                     Mean   : 56.39                      Mean   :36.35  
##                     3rd Qu.: 77.00                      3rd Qu.:37.97  
##                     Max.   :113.00                      Max.   :41.76  
##                                                                        
##  site_longitude  
##  Min.   :-124.2  
##  1st Qu.:-121.6  
##  Median :-119.8  
##  Mean   :-119.8  
##  3rd Qu.:-118.1  
##  Max.   :-115.5  
## 
summary(pm2019$daily_mean_pm2_5_concentration) #Key variable
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -2.200   4.000   6.500   7.734   9.900 120.900
mean(is.na(pm2019$daily_mean_pm2_5_concentration)) #No missing values
## [1] 0
hist(pm2019$daily_mean_pm2_5_concentration, breaks = 100, main = "Histogram of all 2019 PM 2.5 data", xlab = "PM 2.5", xlim=c(0,100)) #2019 histogram

Summary of findings: The 2019 data is significantly larger than the 2004 data. The max values of PM2.5 are very high compared to the mean. This will be investigated later to see whether it’s an error or a natural occurrence.


Problem 2. Combine the two years of data into one data frame. Use the Date variable to create a new column for year, which will serve as an identifier. Change the names of the key variables so that they are easier to refer to in your code.

# Combine the two years of data into one data frame.
pm <- rbind(pm2004, pm2019)

# Use the Date variable to create a new column for year, which will serve as an identifier.
dates <- pm$date
dates2 <- as.POSIXct(dates, format = "%m/%d/%Y")
dates3 <- format(dates2, format="%Y") 

pm <-  mutate(pm, year = dates3) #New column "year" added

# Change the names of the key variables so that they are easier to refer to in your code.
pm <- rename(pm, pm25 = daily_mean_pm2_5_concentration)

pm$year <- as.numeric(pm$year) #so that colorNumeric works


Problem 3. Create a basic map in leaflet() that shows the locations of the sites (make sure to use different colors for each year). Summarize the spatial distribution of the monitoring sites.

# Creating a basic map in leaflet() that shows the locations of the sites.
pal <- colorNumeric(palette = "RdYlBu", domain=c(2004, 2019)) 

leaflet(pm) %>%
  addProviderTiles('OpenStreetMap') %>% 
  addCircles(lat=~site_latitude,lng=~site_longitude, color=pal(pm$year), opacity=1, fillOpacity=1, radius=100)
  #addLegend('bottomleft', pal=pal, values=~c(2004, 2019),
          #title='Site Locations', opacity=1)

As seen in this map, both years have well distributed sites, but 2019 (blue) has considerably more sites. The sites tend to be concentrated along the coast, and by the large cities. This makes sense, as high PM2.5 values are associated with combustion of gasoline, so these areas require greater scrutiny. (I couldn’t get the leaflet “addLegend” functionality to work)

Problem 4. Check for any missing or implausible values of PM2.5 in the combined data set. Explore the proportions of each and provide a summary of any temporal patterns you see in these observations.

I check for NA values in the entire data set, and find which variable they belong to:

mean(is.na(pm))
## [1] 0.003578063
mean(is.na(pm$cbsa_code))
## [1] 0.07513931
mean(is.na(pm$cbsa_code))/mean(is.na(pm))
## [1] 21

0.36% of all data is missing, and 7.5% of the “cbsa_code” data is missing. As 7.5 is 21x greater than 0.36, and there are 21 variables, that means all missing data is from the “cbsa_code” column.

Now I investigate extreme values in the data set to determine whether they’re errors or natural outliers. I start with Daily AQI values:

# Extreme values
boxplot(pm$daily_aqi_value, main = "Boxplot of all Daily AQI Values")

summary(pm$daily_aqi_value) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   18.00   30.00   34.75   47.00  301.00

Here we see an extreme outlier of AQI = 301. According to airnow.gov, an AQI (air quality index) of 301 or higher represents an emergency condition. This value is most associated with wildfires, which is not unreasonable. I now consider PM2.5 values:

boxplot(pm$pm25, main = "Boxplot of all PM 2.5 Values")

summary(pm$pm25) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -2.200   4.400   7.200   9.168  11.300 251.000

PM 2.5 has an extremely high Max value of 251, perhaps this is associated with the same event observed in the high AQI event discovered above?

pm[which(grepl(251, pm$pm25)), daily_aqi_value]
## [1] 301

The extreme PM2.5 and AQI values did indeed come from the same reading, suggesting they are both accurate readings of an extreme weather event. Investigating further:

# at which site is this supposed wildfire happening?
wildfire_site_index = which(grepl(251, pm$pm25))
wildfire_site_id = pm[wildfire_site_index,site_id]
#what year?
wildfire_year = pm[wildfire_site_index,year]

Plotting PM2.5 at the site in question to see if it’s insightful:

pmfire <- filter(pm, site_id == wildfire_site_id, year == wildfire_year)
ggplot(data = pmfire, aes(x=date, y=pm25)) +
  geom_point() + ggtitle("PM 2.5 at site 60431001 in 2004") + theme(axis.text.x = element_blank())

That graph pretty clearly tells the story of a natural extreme event, likely a wildfire. Just to confirm, I plot the Daily AQI values:

ggplot(data = pmfire, aes(x=date, y=daily_aqi_value)) +
  geom_point() + ggtitle("Daily AQI values at site 60431001 in 2004") + theme(axis.text.x = element_blank())

The plot of Daily AQI values at this site confirms the story.

Moving on, I check if the sites with the highest average PM2.5 values seem plausible:

# calc ave PM2.5 by site
pm_ave <- pm %>% group_by(site_id) %>% summarize(mean25 = mean(pm25))
# max(pm_ave$mean25) # find the highest average pm2.5 value (unused)
#which index of pm_ave has this max average value?
high_site_index = which(pm_ave$mean25 %in% max(pm_ave$mean25))
#what is the site id associated with this max value?
high_site = pm_ave$site_id[high_site_index]
#let's graph pm25 at this site in 2004
pmhigh_2004 <- filter(pm, site_id == high_site, year == 2004)
ggplot(data = pmhigh_2004, aes(x=date, y=pm25)) +
  geom_point() + ggtitle("PM 2.5 at site with highest average PM 2.5") + theme(axis.text.x = element_blank())

This sparse data wasn’t very revealing, let’s try the second highest value:

next_highest = max( pm_ave$mean25[pm_ave$mean25!=max(pm_ave$mean25)])
nexthighest_index = which(pm_ave$mean25 %in% next_highest)
nexthighest_site = pm_ave$site_id[nexthighest_index]
pmnexthigh_2004 <- filter(pm, site_id == nexthighest_site, year == 2004)
ggplot(data = pmnexthigh_2004, aes(x=date, y=pm25)) +
  geom_point() + ggtitle("PM 2.5 at site with second highest average PM 2.5") + theme(axis.text.x = element_blank())

This was done to reveal whether the outliers in the data set followed a discernible pattern or were random. I found these high average PM2.5 sites to have credible data - but they don’t suggest a wildfire like the previous investigation did.


Problem 5. Explore the main question of interest at three different spatial levels. Create exploratory plots (e.g. boxplots, histograms, line plots) and summary statistics that best suit each level of data. Be sure to write up explanations of what you observe in these data.

Much of the code below was inspired by Chapter 16 of Exploratory Data Analysis by Roger Peng.


State-wide analysis:

#Boxplots of 2004 and 2019 PM25 for the entire state
ggplot(data = pm, aes(y=pm25)) +
  geom_boxplot()+
  facet_wrap(~year)+
  coord_cartesian(ylim = c(0,150)) + 
  ggtitle("PM 2.5 at all sites in California, per year")

#Summary of 2004 and 2019 PM2.5 statistics for the entire state
with(pm, tapply(pm25, year, summary))
## $`2004`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -0.10    6.00   10.10   13.13   16.30  251.00 
## 
## $`2019`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -2.200   4.000   6.500   7.734   9.900 120.900

It is clear from these box plots and statistical summaries that 2019 has a lower mean, smaller spread, and fewer extreme outliers in PM2.5 than did 2004.

For the following county-wide analysis, I’ve chosen Los Angeles county, as it is my home:

#Filtering down the entire data set to just one county
LA25 <- filter(pm, county == "Los Angeles")

#Box plot of PM2.5 in only Los Angeles County
ggplot(data = LA25, aes(y=pm25)) +
  geom_boxplot() +
  facet_wrap(~year) +
  ggtitle("PM 2.5 at all sites in Los Angeles County, per year")

with(LA25, tapply(pm25, year, summary))
## $`2004`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.1    10.5    14.7    17.1    20.4    75.6 
## 
## $`2019`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -0.50    6.40    9.50   10.17   12.90  120.90

Now I will investigate a specific site within Los Angeles county. First, I find a site which was monitoring during both years. Then I count the total readings at each site, to ensure the site I choose has plenty of data.

##Making a list of all unique site IDs in LA county
sites <- filter(pm, county == "Los Angeles") %>% select(site_id, year) %>% unique
site.year <- with(sites, split(site_id, year))
both <- intersect(site.year[[1]], site.year[[2]])

#Making a data frame with only sites from LA county in use in both years
count <- mutate(pm) %>% filter(site_id %in% both)

#Getting a count of readings at each site to ensure the one we choose is interesting
group_by(count, site_id) %>% summarize(n = n())
## # A tibble: 8 x 2
##    site_id     n
##      <int> <int>
## 1 60370002   400
## 2 60371103  1401
## 3 60371201   586
## 4 60372005   288
## 5 60374002   484
## 6 60374004  1043
## 7 60379033   471
## 8 60379034   219

I continue on my analysis of the site with the most readings: site_id 60371103 (North Main Street, my stomping grounds!)

mainst25 <- filter(pm, site_id == 60371103) %>% select(date, year, pm25) %>% mutate(date2 = as.Date(date, format="%m/%d/%Y"), yday = yday(date2))

#
qplot(yday, pm25, data=mainst25, facets = . ~ year, main = "PM 2.5 on Main St. in Los Angeles, 2004 and 2019", xlab = "Day of the year")

This was a very illuminating visualization! It is clear how much PM 2.5 has dropped in this 15 year period. I feel confident breathing deep as I write this report.

Now I will compare the average PM 2.5 value at each site in the state across the years. I will identify all of the sites which were monitoring both years. I will then calculate the mean of PM2.5 for each site in both years, and then I’ll plot a line between the two values to easily see whether most sites dropped between the years.

#Making a data frame with only the sites which were present at each year
sites2 <- filter(pm) %>% select(site_id, year) %>% unique
sites2.year <- with(sites2, split(site_id, year))
both2 <- intersect(sites2.year[[1]], sites2.year[[2]])
count2 <- mutate(pm) %>% filter(site_id %in% both2) #this is the final DF

#calculate the mean of PM 2.5 for each site in both years
mn <- group_by(count2, year, site_id) %>% summarize(pm25_mean = mean(pm25, na.rm = TRUE))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
#plot
qplot(Change_over_years, pm25_mean, data = mutate(mn, Change_over_years = as.numeric(as.character(year))), color = factor(site_id), geom = c("point", "line"), main = "Average PM2.5 at each site between 2004-2019") + theme(legend.position="none")

It is clear that the vast majority of sites reduced their average PM2.5 value between the years of 2004 and 2019.